initialization method
- Asia > Singapore (0.05)
- North America > United States (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
LoRA-GA: Low-Rank Adaptation with Gradient Approximation
Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. LoRA, as one of the most popular Parameter-Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by fine-tuning an auxiliary low-rank model that has significantly fewer parameters. Although LoRA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of LoRA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, LoRA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that LoRA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla LoRA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, LoRA-GA outperforms LoRA by 5.69% on average. On larger models such as Llama 2-7B, LoRA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MTbench, GSM8k, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla LoRA, validating its effectiveness in accelerating convergence and enhancing model performance.
How to Initialize your Network? Robust Initialization for WeightNorm & ResNets
Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initialization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.
A comparison between initialization strategies for the infinite hidden Markov model
Cortese, Federico P., Rossini, Luca
Infinite hidden Markov models provide a flexible framework for modelling time series with structural changes and complex dynamics, without requiring the number of latent states to be specified in advance. This flexibility is achieved through the hierarchical Dirichlet process prior, while efficient Bayesian inference is enabled by the beam sampler, which combines dynamic programming with slice sampling to truncate the infinite state space adaptively. Despite extensive methodological developments, the role of initialization in this framework has received limited attention. This study addresses this gap by systematically evaluating initialization strategies commonly used for finite hidden Markov models and assessing their suitability in the infinite setting. Results from both simulated and real datasets show that distance-based clustering initializations consistently outperform model-based and uniform alternatives, the latter being the most widely adopted in the existing literature.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Romania (0.04)
- (10 more...)
- Energy (0.68)
- Banking & Finance > Economy (0.68)
Neural network initialization with nonlinear characteristics and information on spectral bias
Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, it is well-known that neural networks tend to learn coarse information in the earlier layers. The feature is called spectral bias. In this work, we investigate the effects of utilizing information on the spectral bias in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic spectral properties in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.
Token Distillation: Attention-aware Input Embeddings For New Tokens
Dobler, Konstantin, Elliott, Desmond, de Melo, Gerard
New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. This excessive tokenization not only leads to reduced performance on downstream tasks (Rust et al., 2021; Ali et al., 2024) but also increases the computational Although adding new tokens to a model's vocabulary can reduce over-tokenization, it Whenever we wish to add a new token to a pretrained model's vocabulary, this new token may The semantics of a word composed of multiple subtokens will largely not be stored in their raw input embeddings at all - but rather constructed by the Transformer's attention/feed-forward layer stack during contextualization (Elhage et al., 2022; Lad et al., 2024; We demonstrate the efficacy of our method, dubbed "Token Distillation", in Section 5. We illustrate Our experimental setup is detailed in Section 4. In summary, our contributions are as follows. We motivate our proposed method by describing the fundamental limitations of current embedding initialization methods and empirically verify our claims. Most state-of-the-art Large Language Models (LLMs) are trained using a static tokenizer, usually derived by a byte-pair encoding scheme before model training (Sennrich et al., 2016). Furthermore, Lesci et al. (2025) show that in practice, words which are not a single A solution to this problem is to modify the existing vocabulary to suit the specific needs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- (16 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Education (0.93)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)